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Several variations on algorithms for dynamic time warping for 
speech processing applications have been proposed. This paper com- 
pares two of these algorithms, the fixed-range method and the local 
minimum method. We show that, based on results from some simple 
word spotting and connected word recognition experiments, the local 
minimum method performs considerably better than the fixed-range 
method. We describe explanations of this behavior and techniques 
for optimizing the parameters of the local minimum algorithm for 
both word spotting and connected word recognition. 

I. INTRODUCTION 

Time registration of a test and a reference pattern is one of the 
fundamental problems in the area of automatic speech recognition. 
This problem is important because the time scales of a test and a 
reference pattern are not perfectly aligned. In some cases the time 
scales can be registered by a simple linear compression or expansion 1,2 ; 
however, in most cases, a nonlinear time warping is required to 
compensate for local compression or expansion of the time scale. For 
such cases, the class of algorithms known as dynamic time warping 
(dtw) methods has been developed. Work by Sakoe and Chiba, 3 
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MIT, April 1980. 
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Itakura, 4 and White and Neely 2 has shown that dtw algorithms are an 
effective method of time registering patterns in isolated word recog- 
nition systems. Bridle 5 and Christiansen and Rushforth 6 have studied 
the applicability of dtw algorithms to word spotting, and recently, 
Sakoe, 7 Rabiner and Schmidt, 8 and Myers and Rabiner, 9 have success- 
fully applied dynamic time-warping techniques to connected digit 
recognition. A great deal of work has been done in the area of 
performance evaluation of the various dtw algorithms as applied to 
discrete word recognition. 10 " 12 However, the effects of the dtw param- 
eters on the overall performance of the algorithm for either word 
spotting or connected word recognition are not as well understood. 
The purpose of this paper is to discuss several proposed methods of 
applying dtw algorithms to word spotting and connected word recog- 
nition, and to study some of the factors which determine the perform- 
ance of these algorithms. 

The organization of this paper is as follows. In Section II we review 
the basic dynamic programming method of time alignment and show 
how it may be used efficiently in either a word spotting or a connected 
word recognition problem. We describe, in detail, two different dtw 
algorithms for which we have performed extensive evaluations. Section 
III contains a description of the experiments which we performed to 
evaluate the performance of the different dtw algorithms and the 
effects of the parameters associated with them. In Section IV we 
summarize the results of these experiments and draw sonie general 
conclusions on the use of dtw algorithms for word spotting and 
connected word recognition. 

II. DYNAMIC PROGRAMMING FOR TIME ALIGNMENT 

In this section we first review the basic principles of dtw algorithms 
as applied to discrete word recognition, and then point out some of the 
inherent difficulties involved in applying these algorithms to word 
spotting and connected speech recognition. We then show how it is 
possible to modify the basic dtw idea so that it may be used for both 
connected word recognition and word spotting applications. 

2. 1 Dynamic time warping for discrete word recognition 

The problem of time alignment for discrete word recognition is 
illustrated in Fig. 1. A reference pattern, R(n), n = 1, 2, • • •, N, 
consisting of a time sequence (i.e., frames) of a multidimensional 
feature vector is to be time registered with a test pattern, T(ra), m = 
1, 2, • • • , M, which is also represented as a time sequence of a 
multidimensional feature vector. In Fig. 1, for the sake of clarity, both 
R(n) and T(ra) are shown as one-dimensional functions. We shall 
assume that both the reference and the test pattern are measured from 
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Fig. 1 — Time warping of a reference and a test pattern. 

the acoustic waveform of a single word, spoken in isolation, and that 
both the beginning and ending points of the reference and the test 
pattern have been accurately determined. The problem of time align- 
ment is to find the path, here parameterized by the function pair (i(k), 
j(k)), which minimizes a given distance metric. A typical distance 
metric* is of the form 



D(i(k), j(k)) = 



£ d(i(k),j(k))W(k) 

N(W) 



(1) 



where K is the length of the path, d(i{k), j(k)) is the local distance, or 
dissimilarity, between frame i(k) of the reference pattern and frame 
j(k) of the test pattern, W(k) is a weighting function applied to the 
path, and N(W') is a normalization factor which is based on the 
particular weighting function that is chosen. 

In addition to minimizing the global distance, the time alignment 
path is chosen to have certain desirable properties. One important 
property is the proper time registration of the beginning and ending 
points of the test and reference patterns, i.e., 



* D is shown here as a functional of the path function pair (Hk),j(k)). 
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id) = 1, y(l) = 1, 



(2a) 



i(K) = N, j(K) = M. 



(2b) 



Also, the time alignment path is required to obey certain shape and 
slope constraints. For example, it would not be reasonable to allow a 
path for which a 10 to 1 expansion or compression of the time axis 
occurs. Another consideration is the preservation of time order, i.e., 
the functions i(k) and j(k) must both be monotonically increasing. 

These local continuity constraints are generally described by speci- 
fying the full path in terms of simple local paths which may be pieced 
together to form larger paths. For example, to reach a grid print (n, m) 
it may be reasonable to have come from any of the grid points (n — 1, 
m — 1), (n — 1, m — 2), or (n — 2, m — 1), as shown in Fig. 2, part a. We 
refer to these constraints as Type I local constraints. Some other 
proposed sets of local constraints are shown in parts b, c, and d of Fig. 
2. The crossed out arc in part d signifies the restriction that a path 
may not move horizontally for two consecutive segments. 4 All these 
local constraints limit the overall slope of the time alignment contour 
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Fig. 2 — Local constraints used for dynamic time warping. 
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to be between x h and 2, in accordance with the results found by Sakoe 
and Chiba. 9 

To solve for the optimal time-alignment path, both the weighting 
function, W(k), and the normalization factor, N( W), must be specified 
in addition to the local constraints. Typically W(k) is chosen to be 
either of two functions, i.e., 

W(k) = i(k) - i(k - 1) (Type a), (3a) 

W(k) = i(k) - i(k - 1) + j(k) - j(k - 1) (Type b). (3b) 

These two weighting functions are referred to as the asymmetric 
weighting function, Type a, and the symmetric weighting function, 
Type b, and were originally proposed by Sakoe and Chiba. 3 Weighting 
function Type a weights all frames of the reference pattern equally, 
while weighting function Type b weights all frames of both the refer- 
ence and the test equally. For initialization purposes, *(0) and j(0) are 
defined to be and thus W(l) = 1 for weighting function Type a and 
W(l) = 2 for weighting function Type b. 

The choice of N( W) is typically made such that D(i(k), j(k)) is the 
average local distance along the path defined by i(k) and ./'(&), and is 
independent of both the lengths of the reference and test patterns, as 
well as the length of the time alignment path itself. The natural choice 
forN(W) is thus 

K 

N(W) = %W(k). (4) 

For weighting functions Types a and b the normalization is given by 
N(W.) = I (i(k) - i(k - 1)) = i(K) - i<0) = N, (5a) 

NdV„) = I (i(k) - i(k - 1) + j(k) - j(k - 1)) 

A-l 

= i(K) - i(0) + j(K) - /(O) = N + M. (5b) 

Given a weighting function and a set of local constraints it is possible 
to define the optimal time-alignment path as that path which mini- 
mizes the total distance D(i(k), j(k)). More formally, if we denote the 
distance associated with the optimal path as D, then 

D= min [D(i(k)J(k))]. (6) 

K.Uk)J(k) 

The solution to this problem may be found by dynamic programming 
by use of the following optimality principle: 

Local Optimality: If the best path from the grid point (1, 1) to the 
grid point (n, m) goes through a grid point (n',m'), then the best path 

DYNAMIC TIME WARPING 307 



from the grid point (1, 1) to the grid point (n, m) includes, as a portion 
of it, the best path from the grid point (1, 1) to the grid point (n',m'). 
Thus, if we define D A (n, m) as the minimum total distance along 
any path from the grid point (1, 1) to the grid point (n, m), then D A (n, 
m) can be computed, recursively according to the optimality principle, 
as 

D A (n, m) = min [D A (n', m') + d{(n', m'), (n, m))~\, (7) 

n'.m' 

where 3,{{n',m'), (n, m)) is the weighted distance from the grid point 
(n', m') to the grid point (n, m). For example, for Type I local 
constraints and an asymmetric weighting function, n' and m' may take 
on any of the following values, 

(n\ m') G {(n - 1, m - 1), (n - 1, m - 2), (n - 2, m - 1)} (8) 

and d((n', m'), (n, m)) is given by 

d((n — 1, m — 1), (n, m)) = d(n, m), (9a) 

d((n — 1, m — 2), (n, m)) = d(n, m), (9b) 

d((n - 2, m - 1), (n, m)) = 2d(n, m). (9c) 

Thus the full dtw recursion for Type I local constraints and weighting 
function Type a is given by 

D A (n, m) = min[D A (n — 1, m — 1) + d(n, m), D A (n — 1, m - 2) 

+ d(n, m), D A (n - 2, m - 1) + 2d{n, m)]. (10) 

Using the local optimality principle, a complete dtw algorithm is given 
by the algorithm 

Step 1. Initialize D A (l, 1) = rf(l, l)W(l). 

Step 2. Compute D A (n, m) recursively for l<«<iV, l<m<M. 

Step 3. D = D A (N,M)/mW). 

This completes our review of the basic principles involved in apply- 
ing dynamic programming to discrete word recognition. We will now 
describe the difficulties which arise when dtw algorithms are applied 
to connected word recognition problems and then we will show how 
the dtw principle can be modified for word spotting and connected 
word recognition applications. 

2.2 Difficulties in connected word recognition 

We shall assume that we are given a test pattern consisting of a 
sequence of connected words, spoken in a normal manner, for which 
the global beginning and ending points have been accurately located 
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and for which no further segmentation has been attempted. Given 
such a framework, the word spotting problem is to determine all 
subsections of the test pattern, if any, which match with a specified 
reference pattern, called the keyword. Thus, for word spotting a 
multiplicity of regions of the test pattern must be compared with the 
keyword pattern. 

The connected word recognition problem, on the other hand, is to 
piece together reference patterns (obtained, in all our work, from 
isolated occurrences of words) to match the test pattern. The general 
approach to this problem will be the one proposed by Levinson and 
Rosenberg, 13 namely: 
(i) Find the reference pattern that best fits a given section of the 

test pattern. 
(«) Use the position within the test pattern at which the best match- 
ing word ends to postulate the beginning of the following word. 
(Hi) Continue to concatenate reference patterns in this manner until 
the test pattern is exhausted. 
Dynamic time-warping algorithms, as they have been applied to 
discrete word recognition applications, are not directly applicable to 
either the word spotting or the connected word recognition problem. 
There are two reasons why this is so. Figure 3 illustrates some of the 
problems which are encountered. In this figure we show the time 
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Fig. 3 — Log energy for two speech utterances. 
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pattern for log intensity of two speech utterances, "3," "8" in part a, 
and "38" in part b. The utterance in part a was spoken with a 
discernible pause between the "3" and the "8," while the utterance in 
part b was spoken with no discernible pause between the "3" and the 
"8." Dynamic time-warping algorithms, as they have been applied to 
discrete word recognition, require a reliable set of word boundaries. 
However, as seen in Fig. 3b, a reliable segmentation for the utterance 
"38" is difficult, if not impossible, to obtain. 

Another difficulty in using dtw algorithms, based on isolated word 
reference templates, for connected speech applications is the problem 
of coarticulation between words. For example, the final /i/ of the word 
"3" and the initial /e'/ of the word "8" coarticulate strongly with each 
other. Thus, another fundamental assumption that has been relied on, 
namely that the characteristics of the isolated reference words which 
we are trying to match to our test utterance can be truly found in the 
test pattern, is not valid. In the next section we will describe the basic 
techniques that will be used to overcome these difficulties. 

2.3 Basic approaches to connected speech recognition problems 

In our approach to connected word recognition and word spotting 
we will make two changes from the structure of the isolated word dtw 
algorithm. One change is to no longer attempt to find the entire 
isolated reference pattern in the test pattern. We will still use isolated 
words as our reference patterns but will only expect a good match in 
the middle of the word, and not necessarily near the ends. Thus, we 
will not require that we be able to accurately match the beginning and 
ending points of the reference pattern to points within the test pattern. 
As a result, we would like to consider the possibility of overlapping 
reference patterns to recognize connected speech. In this manner we 
hope to account for both errors in the endpoint locations and for some 
of the gross features of coarticulation. 

Another fundamental modification to the basic dtw algorithm is the 
use of beginning and ending regions rather than beginning and ending 
frames. In this manner we hope to avoid some of the problems inherent 
in requiring an accurate segmentation of the test utterance. Figure 4 
defines, within a test pattern, a beginning region of size B (frames), 
with potential starting frames between b\ and b 2 {B = 6 2 — 61 + 1), and 
an ending region of size E, with potential ending frames between e\ 
and e2 (E = e 2 — ei + 1). One possible dtw constraint would be that 
the best time-alignment contour may begin anywhere within the 
beginning region and end anywhere within the ending region. Three 
such potential paths are shown in Fig. 4. Such a framework would be 
used for word spotting, in which the beginning and ending regions 
correspond to the entire test pattern, or for connected word recogni- 
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Fig. 4— Illustration of the use of beginning and ending regions. 

tion, in which the ending region for one word is used to hypothesize 
the beginning region for the next word. 

The use of beginning and ending regions modify the basic dtw 
algorithm by changing the constraints which are imposed on the ends 
of the time- alignment contour, i.e., 

i(l) = 1, y(l) =b, 6, < b < b 2 , (Ha) 



i(K) = N, j(K) = e, 



ei < e < e-2. 



(lib) 



Thus, to find the optimal time-alignment contour, every possible 
beginning and ending point pair must be tried, that is, 



D = 
min 



min 



min [Dim, J(k)) s.t. y(l) = b, j(K) = e] 

K.Hk)J(k) 



(12) 



The amount of computation required to solve eq. (12) for the optimal 
path can be excessive, i.e., theoretically we require B - E separate time 
warps in the most general case. However, the amount of computation 
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required to solve eq. (12) may be reduced to a single time warp by 
judicious selection of the weighting function. If W(k) is chosen to be 
the asymmetric weighting function, Type a (W a (k) = i(k) — i(k — 1)), 
and N(W) is chosen appropriately (N(Wa) = N), then D may be 
computed efficiently by a modified dtw algorithm as follows: 

Step 1. Set D A (1, b) = d(l, b) for fc, < b < b 2 , 
Step 2. Compute Da(ti, m) recursively for 1 < n < N, 
b\ < m < e2, 

Step 3. D = — min [D A (N, e)]. 

This algorithm works because Step 1 initializes all possible beginning 
points, Step 2 computes the best path to a point (n, m) from any of the 
potential beginning points initialized in Step 1, and Step 3 finds the 
best possible ending point along any path from any possible beginning 
point. The particular choice of the asymmetric weighting function is 
important because its normalization factor is unaffected by the choice 
of the beginning or ending points, i.e., its normalization factor is always 
N. A dependence on the length of the test pattern, as in the symmetric 
weighting function, Type b, would require a separate time warp for 
each set of beginning and ending points because the effective length of 
the test (e — b + 1) depends on the choice of the beginning and ending 
points. 

An important factor, even with the savings of a single time warp, is 
the large amount of computation required for the dtw algorithm. Step 
2 of the modified dtw algorithm is defined for 1 < n < N, b\ < m < e 2 
and this region may be as large as N-M. It is also not possible to 
significantly reduce this size by using restrictions on the slope of the 
warping contour when the ending region is left unspecified. This point 
is illustrated in Fig. 5, where the slope of the warping function 
is restricted to be between x h and 2. We observe that, even with 
this restriction, when no ending region is specified, the area for which 
Da(ii, m) must be computed is %N 2 + B • N. 

Two modifications to the dtw algorithm have been suggested to 
reduce this amount of computation. In particular, Sakoe and Chiba 3 
have proposed that a time-warping path not be allowed to deviate 
significantly from a straight line, i.e., for any i(k), the value of j(k) is 
restricted such that 

\j(k)-i{k)-b + l\<R, (13) 

where b is the center of the beginning region [b = (6i + &2)/2] and R 
is the maximum deviation which is allowed. R must be chosen to at' 
least cover the entire beginning region, i.e., 2R + 1 > B. This algorithm 
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Fig, 5 — Region of the (n, m) plane which is examined in a time warp for which no 
ending region is specified. 



will be referred to as the fixed range dtw algorithm and is illustrated 
in Fig. 6a. Another range-reduction technique, proposed by Rabiner, 
Rosenberg, and Levinson 10 and described in detail by Rabiner and 
Schmidt 8 is shown in Fig. 6b. Here j(k) is restricted to be within a 
fixed range about the best path so far, that is, the local minimum 
Formally, we have 

\j(k)-c(k)\<e, (14a) 



c(k) = argmm[DA(i(k) - 1, m)] t 



(14b) 



c(D = 6, 



(14c) 
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Fig. 6 — Illustration of the fixed range and the local minimum dtw algorithms. 

where c(k) is the position, in the vertical direction, of the local 
minimum of DA(i{k) — 1, m), and e is the allowable range about this 
local minimum. Thus, if Da(ji, m) is computed in successive vertical 
strips, i.e., n is fixed and m is varied, then the range of one vertical 
strip is ±€ about the local minimum of the previous vertical strip. This 
algorithm is referred to as the local minimum dtw algorithm. 
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Two fundamental differences exist between these two algorithms. 
The fixed range dtw algorithm, a priori, specifies the ending region 
from the specification of the beginning regions, i.e., 

E = 2R + l, (15a) 

ei = b + N-R, (15b) 

e 2 = b + N + R, (15c) 

while the local minimum dtw algorithm defines the ending region 
implicitly from the local minimum of the last vertical strip, i.e., 

E = 2e + 1, (16a) 

e, = c(K) - €, (16b) 

e 2 = c(K) + €. (16c) 

The other fundamental difference between the two time-warping 
algorithms involves the number of time warps required to cover a 
beginning region. For the fixed range dtw algorithm the entire begin- 
ning region is most efficiently covered in a single time warp with 
2R + 1 = B, rather than several smaller time warps, because overlap- 
ping time warps may be merged together without loss of accuracy. 

However, an analogous specification of the local minimum time- 
warping algorithm (2e + 1 = B) may not be truly optimal. Since one 
application of the local minimum dtw algorithm may follow only one 
local minimum path, erroneous decisions may be made because the 
true path may be "lost," i.e., the globally best path may not be within 
e frames of the locally best path. As such, it may be better to try 
several smaller local-minimum time warps, thus allowing several dif- 
ferent local-minimum paths to be tried, and to compare the results of 
these paths to determine the overall "best" path. Such a procedure is 
illustrated in Fig. 7. We assume that NTRY local minimum time warps 
are to be computed. Each time warp has (about its respective local 
minimum) a local range of ±e and the centers of two adjacent time 
warps are initially separated by 5. The entire region covered by the 
NTRY time warps is given by 

A = 2e + 1 + (NTRY - 1) • 8. (17) 

To cover the entire beginning region, NTRY, e and 8 are chosen so 
that A = B. 

In the next section of this paper we describe experiments designed 
to measure the relative strengths and weaknesses of the fixed range 
and the local minimum dtw algorithms and also to determine reason- 
able choices for the parameters 8, e, and NTRY for both word spotting 
and connected word recognition applications. 
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Fig. 7 — Illustration of the parameters of the local minimum dtw algorithm. 

III. EXPERIMENTS IN DYNAMIC TIME WARPING FOR CONNECTED 
SPEECH RECOGNITION 

This section presents the results of experiments designed to compare 
the fixed range and the local minimum dtw algorithms. We also 
describe the results of several experiments designed to study the 
parameters of the local minimum algorithm. Finally, we show how 
these results may be applied to the problems of word spotting and 
connected word recognition. 

3. 1 Comparison of the time warping algorithms 

In our initial experiment the recognition accuracies achieved by 
both the fixed range and the local minimum dtw algorithms for a 
modified isolated word recognition problem are compared. The test 
utterances consisted of 54 words from a vocabulary of computer terms, 
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spoken by each of 4 talkers, for a total of 216 utterances. The test 
utterances were recorded over a dialed-up telephone line, band-limited 
to 3.2 kHz, digitized at 6.67 kHz, and analyzed every 15 ms with an 
eighth-order lpc analysis using a 45-ms window (i.e., successive frames 
overlapped by 30 ms). Local distance scores, d(i(k), j(k)),v/ere calcu- 
lated using Itakura's log likelihood ratio. 4 The reference patterns 
consisted of two templates per word of the vocabulary formed by a 
speaker-independent clustering technique. 14 * 

To evaluate the relative performance of the two dtw algorithms the 
test utterances were modified so that a beginning region could be 
specified as some range about the true beginning point. No ending 
region was specified. For the sake of comparison, R and e were both 
set equal to eight framest and NTRY was set to one. Figure 8 shows 
the recognition results for both algorithms as a function of the four 
different local constraints (used in the dtw algorithms) defined in 
Section 2.1. We observe that the local minimum dtw algorithm per- 
formed better than the fixed range dtw algorithm for all local con- 
straints. 

In another comparison we generated ten pseudo-connected test 
sequences by artificially embedding (at an arbitrary frame) an isolated 
digit into a connected -digit sequence, both uttered by the same talker. 
We then used both dtw algorithms to "spot" the embedded digit using 
two speaker-dependent templates per digit. The parameters of the two 
dtw algorithms that were used were the same ones as in our initial 
experiment (c = 8, R = 8). To spot the embedded digit, every possible 
beginning region of size 2e + 1 (= 2R + 1) was tried. The number of 
times that the dtw algorithm found the (correct) best path (as deter- 
mined by the lowest overall distance achieved by any beginning region) 
was recorded. We also recorded the ending point of the embedded 
word, as estimated by the word spotting procedure. Results showed 
that both the local minimum and the fixed range dtw algorithms were 
able to locate the endpoint of the embedded word with a high degree 
of accuracy. (The average error between the true ending frame and 
the estimated ending frame was 1.2 frames for both dtw algorithms.) 

Figure 9 shows the relative performance of the two dtw algorithms 
for this simple word spotting experiment. These figures plot the 
number of times that the particular dtw algorithm found the proper 
path (as determined by the lowest-distance score achieved) for each of 



* The speaker-independent reference template set was a subset of the 12 template 
per word set used in Ref. 14. This modification was used to reduce computation (and 
hence reduce accuracy somewhat). For the purpose of our experiments (i.e., the relative 
comparison of the fixed range and the local minimum DTW algorithms) this modification 
was of little consequence. 

t Setting R and e equal is a fair comparison of the two methods since the computation 
is the same for both methods. 
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Fig. 8 — Results for word recognition using both the fixed range and the local minimum 
DTW algorithms. 



the ten embedded digits. We observe from Fig. 9 that the local 
minimum dtw algorithm found the best path more often than the 
fixed range dtw algorithm for almost all digits. 

We also observe that the local minimum algorithm was able to find 
the best path 17 times (the maximum number possible, 2e + 1) for 8 of 
the 10 digits, while the fixed range algorithm never achieved this 
accuracy. 



TYPE 1 LOCAL CONSTRAINTS 



LOCAL MINIMUM (e = 8) 




4 5 6 

EMBEDDED DIGIT 



Fig. 9 — Results for word spotting using both the fixed range and the local minimum 
dtw algorithms. 
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The results of these two sample experiments showed that the local 
minimum dtw algorithm performed consistently better than the fixed- 
range dtw algorithm. In the next section we describe experiments 
designed to more fully study some of the parameters of the local- 
minimum time-warping algorithm. 

3.2 Examination of the parameters of the local-minimum dynamic time- 
warping algorithm 

To understand the effects of the various combinations of the param- 
eters A, 8, NTRY, and e on the performance of the local minimum 
dtw algorithm, a series of connected digit-recognition experiments was 
performed. A total of 80 strings of from 2 to 5 connected digits each 
(20 strings of each length) were recorded by each of the two talkers. 
These strings were the same as those used by Rabiner and Schmidt. 8 
In the recognition task we used two speaker-dependent templates per 
digit. The first step in the experiment was to "spot" the ending point 
of the first digit in each string via a local-minimum algorithm (e = 11, 
NTRY = 1) using the known beginning point of the first digit. Then 
an attempt was made to recognize the second digit in the string. 
Because of inaccuracies in "spotting" the ending point of the first digit, 
and because of coarticulation effects, it was not possible to precisely 
determine the beginning point of the second digit, and, as such, a 
beginning region for the second digit was centered around the ending 
frame of the first digit, as determined by the "spotting" procedure. 
The best candidate for the second digit was chosen as that template 
which achieved the lowest overall average distance, regardless of where 
it ended. Several values of e, 5, A, and NTRY were used and the 
accuracies and distance scores for the recognition of the second digit 
were recorded. 

Figure 10 shows, for a large value of A (27 in this case), the average 
best distance score for all NTRY time warps as a function of 5, for 
several values of e. Two curves are shown in each part of the figure. 
The solid curve is the case when the reference word is the same as the 
second word in the test strings. The dashed curve represents the case 
in which the reference is different from the second word in the test 
string. Examination of Fig. 10 shows that the average best distance for 
both "same words" and "different words" increases as 5 increases. 
However, we observe that when the reference is different from the 
second digit in the test utterance (i.e., the dashed curves), the average 
distance generally increases as 5 increases, but, when the reference and 
the test words are the same (i.e., the solid curves), the average best 
distance is constant for small values of 8 and increases only beyond 
the critical value 8 = 2e + 1. This critical value, 5 = 2e + 1 (shown by 
a caret in the scales of Fig. 10), is a particularly important value of 8 
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because for 8 < 2e + 1, consecutive time warps overlap in their 
beginning regions, and for 5 > 2e + 1 there are frames between two 
consecutive time warps which are not covered by either beginning 
region. When 8 = 2e + 1, we have the case where there is no overlap 
in adjacent beginning regions and no skipped frames between these 
regions. From the results shown in Fig. 10 we conclude that, on 
average, there is no loss in performance in the local-minimum dtw 
algorithm as long as no potential beginning frames are skipped, i.e., as 
long as 8 < 2e + 1. 

One explanation of why 8 may be taken as large as 2e + 1, i.e., no 
overlapping of beginning regions, without an appreciable loss of accu- 
racy, is shown in Fig. 11. Here we show the progress of a set of typical 
paths in which the starting regions overlap. By the nature of the local- 
minimum dtw algorithm, best paths from overlapping time warps tend 
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Fig. 10 — Distance scores for the local minimum dtw algorithm as applied to con- 
nected digit recognition. 
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SECOND WARP 
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REFERENCE 

Fig. 11 — Illustration of path merging for two adjacent local- minimum time warps. 

to merge if there is a good path common to both of their beginning 
regions. Figure 12 shows the effects of path merging (of the local 
minimum dtw algorithm) on the digit recognition accuracies. Here we 
plot the recognition error rate for the second digit in the test sequences 
as a function of 8 for various values of e. We see that, for a fixed e, it 
is possible to increase 8 with essentially no loss in accuracy as long as 
5 < 2e + 1.* 

Figure 12 also shows that e = 6 provides the minimum error rate. It 
is reasonable to expect that as e is made too small, good paths may 
easily become lost; but as e is made too large, incorrect paths may 
start to generate low scores and thus cause errors. Thus, a finite value 
of e is probably optimum. Unfortunately, such a value will have to be 
determined for each application. 

Another interesting effect on recognition accuracy for various com- 
binations of e, 5, A, and NTRY is shown in Fig. 13. Here we plot 
recognition error rates for the second digit of our test utterances for 
two cases, namely e = (A - l)/2 (NTRY = 1), and for the best 
combination of e, 8, and NTRY (as determined by the lowest-recog- 



* Note that for A fixed, the largest possible 5 is 8 = A - 2e - 1 (NTRY = 2) so that 
the curves for the various values of e in Figure 12 are defined only for those values of S 
such that 8 3 A - 2e - 1. 
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Fig. 12— Digit error rate for connected digit recognition using the local minimum dtw 
algorithm and several values of t and S. 

nition error rate). We see that, for smaller values of A, a single warp 
performs as well as any combination of e, 5, and NTRY, and as A 
increases, the differences in error rates between the best possible e, S, 
and NTRY combination and a single warp remains less than 2.5%. 
Thus, it might be possible to perform some type of connected word 
recognition using only a single local-minimum time warp per word. In 
the next section we describe how the results of our experiments have 
actually been applied to both word spotting and connected word 
recognition applications. 
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Fig. 13 — Digit error rates for connected digit recognition using the local minimum 
dtw algorithm. 
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3.3 Application of DTW algorithms to word spotting and connected word 
recognition 

We have shown that, both for connected word recognition and word 
spotting applications, the local minimum dtw algorithm performs 
consistently better than the fixed range dtw algorithm. We have also 
shown that, given a value of e, 8 may be chosen as large as 8 = 2e + 1 
without significant degradation in the performance of the local mini- 
mum dtw algorithm. Since, for a fixed beginning region (i.e., a fixed 
A), the number of time warps is given by NTRY = 1 + (A - 2e - 1)/ 
8, the best choice for 5 is 8 = 2e + 1. This minimizes the number of 
time warps which need to be performed. For the problem of word 
spotting the obvious choice for A is A = M, i.e., the entire length of the 
test pattern. For this case optimal values of e and NTRY must still be 
determined. In general, the selection of e and NTRY depends on 
several factors. As e is increased, the chance of a missed keyword 
decreases because more paths are examined, but the chance of a false 
alarm increases. Also, as € increases, the value of NTRY decreases 
[NTRY = A /(2e + 1) for 5 = 2e + 1], thereby reducing the amount of 
computation required. Thus, misses, false alarms, and the amount of 
computation must be traded-off in the selection of e and NTRY for a 
word spotting application. 

In a connected word recognition application, however, we not only 
must choose € and NTRY but must also choose A. We have shown 
that for A < 17 frames, it is possible to do connected digit recognition 
using only a single local-minimum time warp per word. However, we 
also found that the best recognition accuracy was achieved with A = 
21 but not with a single local-minimum time warp. Thus, there is an 
apparent trade-off between recognition accuracy and speed of com- 
putation. However, work by Rabiner and Schmidt 8 has shown that it 
is better not to center the beginning region of one word around the 
end of the previous word, as we did, but, rather, to center the beginning 
regions of one word several frames earlier than the ending region of 
the previous word. The reason for this is that the isolated reference 
patterns tend to be longer than the spoken connected words, and thus, 
the time warps tend to overestimate the ending frame of each word. 
We tried a simple experiment in which the beginning region of one 
word was centered eight frames earlier in the test pattern than the 
end of the previous word. The values of e, NTRY, and A were e = 8, 
NTRY = 1, and A = 17. Using these values and the same test 
utterances used by Rabiner and Schmidt, 8 i.e., 80 sequences of from 2 
to 5 digits each spoken once by each of six talkers, we achieved a string 
recognition rate of 429 correct strings out of 480 possible. This may be 
compared with a total of 442 correct strings using e = 8, 8 = 3, and 
NTRY = 4, as reported by Rabiner and Schmidt. It should be noted, 
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however, that the system of Rabiner and Schmidt used multiple 
candidate strings while our simple experiment did not. When we reran 
the system of Rabiner and Schmidt using only a single candidate string 
(c = 8, d = 3, NTRY = 4) we found only 430 correct strings out of the 
480 possible. Thus, with a single local-minimum time warp per word 
we achieved results comparable to those achieved by the use of four 
local-minimum time warps per word. 

IV. CONCLUSIONS 

We have shown that dynamic time warping algorithms can be 
efficiently applied to both word spotting and connected word recogni- 
tion. We have demonstrated the relative performance superiority of 
the local minimum dtw algorithm over the fixed-range dtw algorithm. 
It was also shown that the beginning regions of successive applications 
of the local minimum dtw algorithm need not overlap to achieve 
accuracy comparable to overlapping beginning regions. We have found 
that, for small beginning regions (small A), a single local-minimum 
time warp [with e = (A — l)/2, NTRY = 1] was as accurate as (and 
more computationally efficient than) any combination of the param- 
eters e, 5, and NTRY. Finally, we found that an extremely simple 
connected digit recognition system, i.e., a single local-minimum time 
warp per word using only one candidate string, achieved a string 
recognition rate of nearly 90 percent. 
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